A high-quality, long-read genome assembly of the endangered ring-tailed lemur (Lemur catta)

Abstract Background: The ring-tailed lemur (Lemur catta) is a charismatic strepsirrhine primate endemic to Madagascar. These lemurs are of particular interest, given their status as a flagship species and widespread publicity in the popular media. Unfortunately, a recent population decline has resulted in the census population decreasing to <2,500 individuals in the wild, and the species's classification as an endangered species by the IUCN. As is the case for most strepsirrhine primates, only a limited amount of genomic research has been conducted on L. catta, in part owing to the lack of genomic resources. Results: We generated a new high-quality reference genome assembly for L. catta (mLemCat1) that conforms to the standards of the Vertebrate Genomes Project. This new long-read assembly is composed of Pacific Biosciences continuous long reads (CLR data), Optical Mapping Bionano reads, Arima HiC data, and 10X linked reads. The contiguity and completeness of the assembly are extremely high, with scaffold and contig N50 values of 90.982 and 10.570 Mb, respectively. Additionally, when compared to other high-quality primate assemblies, L. catta has the lowest reported number of Alu elements, which results predominantly from a lack of AluS and AluY elements. Conclusions: mLemCat1 is an excellent genomic resource not only for the ring-tailed lemur community, but also for other members of the Lemuridae family, and is the first very long read assembly for a strepsirrhine.

1. In the introduction section, beside background of distribution and taxonomy of ringtailed lemurs, more information will be appreciate including phylogeny position and their biological background such as diet, behavior on so on.
Thank you for this suggestion. We have extended the introductory paragraph to include the following text about ring-tailed lemur ecology and phylogenetic positioning.
"Ring-tailed lemurs are medium-bodied, ecologically flexible members of the Lemuridae family and the sole member of the genus Lemur. In contrast to most other Lemuridae, L. catta predominantly inhabit the dry and seasonal forests of southern Madagascar [1]. They consume an omnivorous diet mostly of fruit and leaves, and engage in a multi-male multi-female social structure with a polygynandrous mating system [1]." 2. During the de novo assembly and subsequent analysis, the authors use several different software packages for their analysis. However, the specific parameter settings for the software used were not given.
Thank you for drawing our attention to this issue, which we have now clarified in the text and added in Additional File 1. In order to keep the text concise, we had not listed every parameter and setting explicitly in the text. However, we have now included a link to the VGP master pipeline in the "De novo assembly" section, which provides these details. All the parameters used for the assembly pipeline can be found in the VGP github, from which our pipeline is derived and includes all the scripts and parameters used. The following websites will be added in the Additional File 1.
The remaining software specific parameters are now present in the text. All RepeatMasker analyses are embedded in the text and commands have been added to Additional File 1.
BUSCOs parameters are also specified in the text and commands have been added to Additional File 1.
The MITOS2 server ran the annotation of the mitogenome with the default parameters.
3. The detailed scaffolding step was also missed for the Arima Hi-C data with Salsa 2.2 [18]. How authors deal with the sequence order? This information could help us to understand how the authors addressed the technical issue such as orientation for the inversion regions within the scaffolds.
Thanks for pointing out this matter. The sequence order is not something we considered specifically, but we suggest that these technical issues should not cause any substantial problems for our assembly, given that the contigs we assembled are of exceptionally long lengths and we used two types of scaffolding technology data, with which the types of errors proposed by the reviewer are unlikely to affect our assembly. Specifically, SALSA2 software paper [2] explains how short contigs lead to higher amounts of misoriented contigs within scaffolds, and outperforms its previous version in this regard. 4. The gapless mitochondrial genomes were assembled by PacBio long reads and 10X short reads, and were annotated the by using the MITOS2 web server. The short sequencing reads were typically chosen and used for most mitochondrial genome assembly. Please explain why both the long reads and short reads were chosen during the assembly, or whether this combined strategy presents any advantages compare with traditional method? In addition, in the annotation process for mitogenome, MITOS2 web server was employed, but the descriptions of the procedures could not been found. The details how to reorder and concatenate the annotated genes and regions are appriciate.
The reviewer raises an important point, and we should have been more clear about it in the manuscript. Details regarding the mitogenome assembly process were recently published (Formenti et al. 2021) as part of the broader mitoVGP pipeline, which we have now clarified and cited. The advantage of our combined short-and-long read strategy is that the highly repetitive nature of the mitochondrial control region (CR) sometimes does not allow for complete error-free assemblies of the mitogenome using short-read data alone. In this specific case there is a small repeat region which is correctly assembled using both long and short reads to obtain the complete mitogenome. We have added the corresponding explanation and citation in the main text.
For the annotation we used MITOS2, a web server that easily annotates genes and regions of any mitochondrial genome. Further details on the procedures can be found in [3], and the corresponding github repository (https://github.com/gavieira/mitos2_wrapper), where you will find the code and specifics of the software, which we did not modify.
5. Please format the references into same style. For example, in reference 19, vs. reference 20. Please revise all "Lemur catta" into italic. Please check and revise according to the policy of GigaScience.
We apologize for this oversight. All references have now been correctly formatted according to GigaScience policy using reference software. 6. Did the author confused the order between Figure 3 and Figure 4?
We apologize for the confusion. We have now reordered the figures during the submission process of the manuscript.

Reviewer #2:
This is a great work conducting genome assembly of this primate species. The assembly would highly benefit from the annotation of the genome (gene annotation) using RNA seq data, however, this seems to be beyond the goals of this manuscript. Since the focus of the study is on the genome assembly, it would be helpful to conduct Chrimosome Synteny analysis with human genome and other primate species to give a big picture of the differences across the species. Below, please the comments to this work.
We very much appreciate the reviewer's kind response and positive assessment. The suggestion of a chromosome synteny analysis is an excellent one, which is described below. Abstract: Continuous Long Read (CLR) NOT (CLR Reads)? Isn't the word "Read" already included in the abbreviation? Not sure what is the standard abbreviation for this term, and if it really needs mentioning the word "Read".
Thank you for pointing out this oversight. We have adjusted both references from "(CLR reads)" to "(CLR data)". "CLR reads" is a commonly used expression in the field, but we agreed that changing it to "CLR data" makes more sense.
Data Description: * Any data on the quality of HMW DNA evaluation? Would be good to cite this data in the first paragraph of the Data Description where the authors mention HMW DNA quality control.
We used a PFGE gel (Sage Pippin Pulse) as a HMW quality control measure. We have added the corresponding image as a supplementary figure ( " Figure S1: Pulse Field Gel assay (Sage Pippin Pulse) with HMW ladder used for quality control of the ultra-High Molecular Weight DNA (Lemur catta is in well number 1)") and the corresponding text ("uHMW DNA quality was assessed by a Pulsed Field Gel assay and quantified with a Qubit 2 Fluorometer ( Figure S1)") in the manuscript as suggested.
* Would be great for the authors to report the results of repeat analysis using Repeat Modeler.
Thank you for this suggestion, which we have given substantial consideration. We decided to run RepeatMasker exclusively for several reasons, but primarily, because there is a high likelihood that a comparable outcome would be produced by RepeatModeler. Additionally, RepeatModeler's results would lack power for comparison between species, because it depends directly on the quality of the assemblies used. More specifically, running RepeatModeler requires the use of a previously established repeat library in order to classify the repeats present in the focal genome to obtain a specific database of repeats for the genome masking. The standard library in this case would be Dfam, which is also what we used for our RepeatMasker run to classify repeats. We suggest that the well-established primates database provided by RepeatMasker, which is derived from a larger number of genomes, is an appropriate choice for the masking of a lemur genome; thus, we are more confident in our results than we otherwise would be by creating a new database based solely on the present genome. Secondly, RepeatModeler is a well-known and commonly used software and the already complete database it provides will allow for more systematic comparison and analysis by other researchers. Creating a database based on the Lemur catta genome alone could help to find specific repeat patterns within the species, but ultimately, it would still be based on the same previously known library of repeats that RepeatMasker uses to classify them. As such, we think that the computational hurdle of running RepeatModeler would not substantially alter our results.
* Any Synteny analysis compared to other primate species? One of the most useful information from a long-read sequencing (and chromosome-level assembly) is the ability to compare the chromosomal synteny with other primates (or just with humans).
We thank the reviewer for drawing our attention to this issue, and agree that this assembly can be a powerful tool for chromosomal comparison and finding syntenies between Lemur catta and other species. For this purpose, we did a synteny analysis creating a dot plot using Mummer v3.23 software's nucmer -mum option and visualized the results of the synteny between the present assembly of Lemur catta (mLemCat1) and an assembly of Homo sapiens (hg38) using the https://dot.sandbox.bio/ website. We have added this synteny plot as a supplementary figure and the following text to the manuscript: "The present assembly (mLemCat1) can be useful to create synteny plots between L. catta and others, such as humans ( Figure S2), as it has N50 statistics comparable to other high-quality primate genomes recently published (Table S2)." We added the Figure S2: An overall chromosomal synteny plot between Lemur catta (mLemCat1 assembly) and Homo sapiens (hg38 assembly) in the supplementary material file.
* What is the number of scaffolds that cover 90% of the genome? How different is this number (the number of scaffolds that cover 90% of the genome) compared to the number of chromosomes for this species? Also, what about N95? Would be good to discuss these statistics more clearly to give a clearer picture of the assembly.
The reviewer raises a good point, which we should have been more clear about in the text. We agree that N50/90/95 and L50/90/95 are important statistics to evaluate a genome assembly landscape. In order to keep the text concise, we have adjusted the manuscript to include both N/L50 and N/L95, but include the additional N/L90 values in the supplemental materials, given their similarity to the N/L90 values. The number of scaffolds that cover 90% of the genome is 24, which is 3 more than that found in the hg38 human assembly (L90 = 21). Regarding the L95 and N95 values, we see a similar trend: mLemCat1 L95 = 28 and N95 = 21.9 Mb; hg38 N95 = 24 and N95 = 46.7 Mb. As the Lemur catta genome is about two-thirds the size of the human genome these contiguity values are similar. Additionally, the expected haploid number of chromosomes [4] in Lemur catta is larger, 29 chromosomes expected (27 autosomal + 2 sexual (reference chromosomes)) is larger than the 22 autosomes and 2 sexual human chromosomes. We have added the following parameters in Table S1: * What other primate species genomes were recently assembled at "chromosome-level assembly" similar to this study and how the N50 of scaffolds from other recent primate genome assemblies is different (or similar) to N50 scaffold size of this assembly? Would be good to mention in the discussion section. There are a few other recent assemblies of primates in GigaScience (over the last 2 years) using similar methods.
Thank you for this suggestion. As the reviewer rightly mentions, there are other recent primate assemblies published in GigaScience that are valuable points of comparison. While searching the GigaScience website for published chromosome-level genomes from the past two years, we were able to identify three such assemblies: Ma2, Panubis1.0, and ASM756505v1. mLemcat1 has a slightly smaller scaffold N50 compared to these recently published primate genomes. However, the size of our Lemur catta genome assembly is at least 25% smaller than the other genome assemblies used for this comparison, which explains its proportionally smaller N50 value. We have added the Table S5: Comparison of scaffold N50 and assembly size of the latest primate genomes published in GigaScience to the supplementary materials and the corresponding text in the discussion. * Repeat analysis would benefit from running 'repeat modeler' in addition to existing analysis.
As we have detailed above, we are confident in our RepeatMasker results and contend that the additional run of Repeat Modeler could lead to additional complications. Additionally, when compared to other high-quality primate assemblies, L. catta has the lowest reported number of Alu elements, which results predominantly from a lack of AluS and AluY elements. mLemCat1 is an excellent genomic resource not only for the ring-tailed lemur community, but also for other members of the Lemuridae family and is the first very long read assembly for a strepsirrhine. 3

Context:
The strepsirrhines are a remarkably diverse radiation of primates that includes more than one quarter of all recognized primate species [1]. The vast majority of strepsirrhines (103 species) are members of the Lemuroidea, colloquially known as "lemurs", and endemic to Madagascar.
Despite their geographic isolation, the lemur radiation is exceptionally diverse, including both the smallest living primate (Microcebus berthae) and one of the largest (the recently extinct subfossil lemur, Archaeoindris fontoynontyii) [2,3]. Although lemurs are highly diverse, they are comparatively understudied relative to other primates, and ~87% of species are threatened with extinction, raising major conservation challenges [1].
Of particular interest, both ecologically and in the public imagination, are ring-tailed lemurs (Lemur catta, NCBI:txid9447). Ring-tailed lemurs are medium-bodied, ecologically flexible members of the Lemuridae family and the sole member of the genus Lemur. [4][5][6]. In contrast to most other Lemuridae, L. catta predominantly inhabit the dry and seasonal forests of southern Madagascar [7]. They consume an omnivorous diet mostly of fruit and leaves, and engage in a multi-male multi-female social structure with a polygynandrous mating system [7]. Ring-tailed lemurs are under severe conservation pressure; they are classified as Endangered by the IUCN [8], resulting primarily from deforestation, hunting, and capture for the pet trade. A recent population census has revealed a dramatic population decline with as few as 2200 individuals remaining in the wild [9]. Of further concern, the species is distributed across a highly fragmented range with only eight populations of at least 100 individuals remaining [9]. Despite this near-term population decline, a recent microsatellite analysis indicates that the genetic diversity of L. catta populations could be exceptionally high, with evidence of genetic isolation by distance throughout their geographic 4 range [6].
From a genomic perspective, relatively little is known about ring-tailed lemurs (and strepsirrhines more broadly). Genome assemblies have been published for 18 strepsirrhine species, but none of these assemblies has a contig N50 value above 1 Mb, and only three of them are above 100 kb [10]. Recently, a Lemur catta genome (LemCat_v1_BIUU) was assembled by the Zoonomia consortium [11], given that it is derived from Illumina short reads, its metrics and application are still limited compared to the genome quality of recent highly contiguous assemblies [12]. This general lack of genomic resources remains a considerable limitation for the comparative and population genomics of lemurs.
Here, we present a new high-quality genome assembly of L. catta (mLemCat1) that conforms to the standards of the Vertebrate Genomes Project (VGP). mLemCat1 was assembled with a combination of PacBio continuous long reads (CLR data), Optical Mapping Bionano reads, Arima HiC data, and 10X linked-reads. Our new assembly will allow for a deep assessment of the genome biology and conservation genomics of endangered ring-tailed lemurs. Additionally, given the paucity of high contiguity strepsirrhine assemblies, it will allow major advances in the genomics of across the Lemuridae family. Gbp from the GoaT database [13], the present genome (mLemCat1) has 86.43X of 10.28X linkedreads data, 66.68X of Arima data, 154.57X of Bionano data and 62.88X of PacBio data.
Purged contigs were removed from the primary assembly and added to the alternate assembly.
The primary scaffolds, alternate contigs, and mitochondrial assembly were polished simultaneously. We first performed Polishing and gap filling with the original PacBio data using Arrow [14], followed by two rounds of short-reads polishing using the 10X linked-reads data.

Genome Quality Assessment
Compared to the currently available short-read Lemur catta genome available (LemCat_v1_BIUU) [11], the new mLemCat1 assembly has higher contiguity values, fewer scaffolds, and a slightly smaller assembly size (Table 1). We generated basic continuity assembly metrics for both assemblies using QUAST V5.0.2 (QUAST, RRID:SCR_001228) [23], which are presented in Table 1. The assembly has a total scaffold size of 2.122 Gb within 141 scaffolds.  [24]. Further comparison can be found in Table S1. The overall GC content of this assembly is 40.48%.
The mLemCat1 assembly has a high level of accuracy and completeness that conforms to the proposed standards of the VGP [12].  number.
In order to assess the functional completeness of the assembly, we recovered BUSCO genes from both mLemCat1 and the existing Illumina-based assembly (LemCat_v1_BIUU) (Figure 2).
The present assembly (mLemCat1) can be useful to create synteny plots between the present species and others, such as humans ( Figure S2), as it has N50 statistics comparable to other high-quality primate genomes, like the pig-tailed macaque [27], olive baboon [28] and golden snub-nosed monkey [29] genome assemblies that have been recently published (Table S2).

Mitogenome of L. catta
We assembled a gapless mitochondrial genome with a span of 17086 bp using both PacBio CLR (long reads) and 10X data (short reads) using MitoVGP v2.2 with additional parameters "-f 18000 -v LENIENT", as described in the Additional File 2: Table S3 of the MitoVGP paper [20], and annotated the assembly using the MITOS2 web server [30]. With the annotation results we plotted a map of the mitochondrion with GenomeVx [31] ( Figure S3). Thirteen main protein coding genes have been annotated in this new mitogenome including nad1, nad2, nad3, nad4, nad4L, nad5, nad6, cox1, cox2, cox3, atp6, atp8 and cob.

Analysis of the repeatome
To assess the structure and variety of repeat elements in the L. catta genome, we analyzed  Table S3). In general terms, the portion of the genome that comprises repetitive elements is similar to that which has been reported for other high-quality catarrhine genomes [32,33], although there are fewer satellites (0.30%), simple repeats (0.68%) and low complexity elements (0.13) (Table S3).
In comparison with the previous Illumina-only assembly (LemCat_v1_BIUU) we observed minor differences in the structure and variety of repeat elements (Figure 4). The new long-read based assembly has 1.31% more interspersed repeats (50.32% vs 49.01%), and a higher percentage of sequence in each repeat subtype, except for satellites, simple repeats, low complexity elements, and ERV classes I & II. We also observed both a lower percentage of sequence and a smaller number of ALU events in mLemCat1. Additionally, the total number of masked bases is lower in the new assembly, but they represent a higher percentage of the sequence, due to mLemCat1 having a shorter span.
Alus are the most abundant repeat elements in the human genome, and differences in their rates, distribution, and proliferation could have led to distinct functional changes in multiple primate lineages [34]. Alu elements have been present since the earliest stages of primate evolution are frequently located in gene-rich regions, and may have an important role in gene regulation [35][36][37][38]. In order to compare the Alu repeat landscape of L. catta with those of other highly-contiguous primate assemblies, we ran RepeatMasker as above adding the -alu option. The genomes used for the comparison were long-read based assemblies, including human (hg38), chimpanzee (panTro6), western gorilla (gorGor6), Sumatran orangutan (ponAbe3), rhesus macaque (rheMac10), common marmoset (calJac4), and gray mouse lemur (Mmur_3.0) (Table S4).
We identified substantially fewer Alu elements in the lemur genomes (L. catta and Microcebus murinus) than those of the catarrhines, with the fewest being found in the L. catta genome (3.66% of repeat elements) ( Figure 3B, Table S5). In contrast to the other primates assessed, for which AluS elements are most abundant, AluJ is the most common element in mLemCat1 (54.17% of Alu events). Both lemurs have fewer AluS events than the anthropoids and fewer AluY events than the catarrhines, consistent with previous reports of the expansion of these two families after the Catarrhini-Strepsirrhini split [39]. The fact that the common marmoset has the highest number of AluS elements ( Figure 3C) confirms that the burst that started before the Catarrhini and Platyrrhini parvorders diverged, continued with different activity in both lineages after their split.
Recent Alu activity (AluY events), is most abundant in catarrhines, particularly the rhesus macaque, which when compared to great apes ( Figure 3C), has a higher overall percentage of Alus ( Figure 3B).

CONCLUSION
We have assembled a new high-quality genome reference for the ring-tailed lemur (L. catta) that satisfies the VGP quality assembly standards. Compared to pre-existing genomic resources, the new assembly has higher contiguity and completeness, and contains more single copy complete BUSCO genes with fewer fragmented or missing genes. Additionally, we analyzed the L. catta repeatome and observed substantially fewer Alu events compared to other high-quality primate assemblies. This assembly illustrates how long-reads and further scaffolding data such as HiC or optical mappings can drastically improve the contiguity and completeness of an assembly, which also allows for improved analysis of structural variation. We suggest that this new assembly will be an excellent resource for the mammalian genomics community, with particular value for the conservation genomics of lemurs.

DATA AVAILABILITY
The raw sequencing data and assembly are available via NCBI BioProject: PRJNA562215.  [40].

ADDITIONAL FILE
Additional file 1: Links to the websites with the assembly pipeline specifics used to create Lemur catta (mLemCat1) genome assembly and command lines used to perform the different analyses.

AUTHOR CONTRIBUTIONS
MPF and JDO analyzed the data; JM and BH generated the data; BH generated the draft assembly; MFB collected the samples. MPF and JDO wrote the paper with contributions from all authors. TMB, EFJ, and OF designed the research.

Figure 2: BUSCO Assessment Results comparison between mLemCat1 and LemCat_v1_BIUU
Lemur catta assemblies using the Primates_ODB10 database (n = 13780). The new mLemCat1 assembly shows a 7.3% increase in complete single copy orthologous genes.  Click here to access/download; Figure;Figure3_Repeats